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And  now  I see  with  eye  serene 
The  very  pulse  of  the  machine. 

— William  Wordsworth 


Abstract 


A systolic  system  is  a network  of  processors  which  rhythmically  compute  and  pass 
data  through  the  system.  Physiologists  use  the  word  "systole"  to  refer  to  the 
rhythmically  recurrent  contraction  of  the  heart  and  arteries  which  pulses  blood 
through  the  body.  In  a systolic  computing  system,  the  function  of  a processor  is 
analogous  to  that  of  the  heart.  Every  processor  regularly  pumps  data  in  and  out, 
each  time  performing  some  short  computation,  so  that  a regular  flow  of  data  is  kept 
up  in  the  network. 

Many  basic  matrix  computations  can  be  pipelined  elegantly  and  efficiently  on 
systolic  networks  having  an  array  structure.  As  an  example,  hexagonally  connected 
processors  can  optimally  perform  matrix  multiplication.  Surprisingly,  a similar 
systolic  array  can  compute  the  LU-decomposition  of  a matrix.  These  systolic  arrays 
enjoy  simple  and  regular  communication  paths,  and  almost  all  processors  used  in  the 
networks  are  identical.  As  a result,  special  purpose  hardware  devices  based  on 
systolic  arrays  can  be  built  inexpensively  using  the  VLSI  technology. 


1.  Introduction 


Developments  in  microelectronics  have  revolutionized  computer  design.  Integrated 
circuit  technology  has  increased  the  number  and  complexity  of  components  that  can 
fit  on  a chip  or  a printed  circuit  board.  Component  density  has  been  doubling  every 
One -to- two  years  and  already,  a multiplier  can  fit  on  a very  large  scale  integrated 
(VLSI)  circuit  chip.  As  a result,  the  new  technology  makes  it  feasible  to  build 
low-cost  special  purpose,  peripheral  devices  to  rapidly  solve  sophisticated  problems. 
Reflecting  the  changing  technology,  this  paper  proposes  new  multiprocessor 
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structures  for  processing  some  basic  matrix  computations. 


We  are  interested  in  high-performance  parallel  structures  that  can  be 
implemented  directly  as  low-cost  hardware  devices.  By  performance,  we  are  not 
refering  to  the  traditional  operation  counts  that  characterize  classical  analyses  of 
algorithms,  but  rather,  the  throughput  obtainable  when  a special  purpose  peripheral 
device  is  attached  to  a general  purpose  host  computer.  This  implies  that  time  spent 
in  1/0,  control,  and  data  movement  as  well  as  arithmetic  must  all  be  considered.  VLSI 
offers  excellent  opportunities  for  inexpensive  implementation  of  high  performance 
devices  (Mead  and  Conway  [1978]).  Thus,  in  this  paper  the  cost  of  a device  will  be 
determined  by  the  expense  of  a VLSI  implementation.  Tit  the  job  to  the  bargain 
components"  — Blakeslee  [1975,  p.  4). 

VLSI  technology  has  made  one  thing  clear.  Simple  and  regular  interconnections 
lead  to  cheap  implementations  and  high  densities,  and  high  density  implies  both  high 
performance  and  low  overhead  for  support  components.  (Sutherland  and  Mead 
[1977]  has  a good  discussion  on  the  importance  of  having  simple  and  regular 
geometries  for  data  paths.)  For  these  reasons,  we  are  interested  in  designing 
multiprocessor  structures  which  have  simple  and  regular  communication  paths.  We 
are  also  interested  in  employing  pipelining  as  a general  method  for  using  these 
structures.  By  pipelining,  computation  may  proceed  concurrently  with  input  and 
output,  and  consequently  overall  execution  time  is  minimized.  Pipelining  plus 
multiprocessing  at  each  stage  of  a pipeline  should  lead  to  the  best -possible 
performance. 

Systolic  systems  provide  a realistic  model  of  computation  which  captures  the 
concepts  of  pipelining,  parallelism  and  interconnection  structures.  We  do  not  want  to 
give  a formal  definition  of  systolic  systems  here.  For  the  purpose  of  this  paper,  it 
suffices  to  view  a systolic  system  as  a network  of  processors  which  rhythmically 
compute  and  pass  data  through  the  system.  The  analogy  is  to  the  rhythmic 
contraction  of  the  heart  which  pulses  blood  through  the  circulatory  system  of  the 
body.  Each  processor  in  a systolic  network  can  be  thought  of  as  a heart  that  pumps 
multiple  streams  of  data  through  itself.  The  regular  beating  of  these  parallel 
processors  keeps  up  a constant  flow  of  data  throughout  the  entire  network.  As  a 
processor  pumps  data  items  through,  it  performs  some  constant -time  computation 
and  may  update  some  of  the  items. 

Unlike  the  closed-loop  circulatory  system  of  the  body,  a systolic  computing  system 
usually  has  ports  into  which  inputs  flow,  and  ports  where  the  results  of  the  systolic 
computation  are  retrieved.  Thus  a systolic  system  can  be  a pipelined  system  - input 
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and  output  occur  with  every  pulsation.  This  makes  (hem  attractive  as  peripheral 
processors  attached  to  the  data  channel  of  a host  computer.  Figure  1-1  illustrates 
how  a special  purpose  systolic  device  might  form  a part  of  a PDP-11  system.  A 
systolic  device  may  also  process  a real-time  data  stream  or  be  a component  in  a 
larger  special  purpose  system. 


Figure  1-1:  A systolic  device  connected  to  the  UNIBUS  of  a PDP-1 1. 


This  paper  deals  largely  with  systolic  systems  where  the  underlying  network  is 
■"•y  structured.  (See  also  Kung  and  Leiserson  [1978J.)  An  array  network  is 
attractive  for  it  enjoys  simple  and  regular  communication  paths.  In  Section  2,  we 
describe  the  basic  hardware  requirements  and  interconnection  schemes  for  the 


systolic  arrays  proposed  and  discuss  the  feasibility  of  building  them  in  VISL  Section 
3 deals  with  the  matrix-vector  multiplication  problem.  Multiplication  of  two  matrices 
is  considered  in  Section  4.  In  Section  5,  we  show  that  essentially  the  same  systolic 
arrays  for  matrix  multiplication  in  Section  4 can  be  used  to  find  the 
LU-decomposition  of  a matrix.  Section  6 is  concerned  with  solving  triangular  linear 
systems.  We  show  that  this  problem  can  be  solved  by  almost  the  same  systolic 
array  for  matrix-vector  multiplication  described  in  Section  3.  Section  7 discusses 
applications  and  extensions  of  the  results  presented  in  the  previous  sections.  The 
applications  include  the  computations  of  finite  impulse  response  filters,  convolutions, 
and  discrete  Fourier  transforms.  Some  concluding  remarks  are  given  in  the  last 


section. 
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The  size  of  each  of  our  systolic  array  networks  is  dependent  only  on  the  band 
width  of  the  band  matrix  to  be  processed,  and  is  independent  of  the  length  of  the 
band.  Thus,  a fixed  size  systolic  array  can  pipeline  band  matrices  with  arbitrarily 
long  bands.  The  pipelining  aspect  of  our  arrays  is,  of  course,  most  effective  for 
band  matrices  with  long  bands.  Band  matrices  are  interesting  in  their  own  right, 
since  many  important  scientific  computations  involve  band  matrices.  For  these 
reasons,  most  of  the  results  in  this  paper  will  be  presented  in  terms  of  their 
applications  to  band  matrices.  All  the  results  apply  to  dense  matrices  since  a dense 


matrix  can  be  viewed  as  a band  matrix  having  the  maximum-possible  band  width. 


2.  The  Basic  Components  and  Systolic  Array  Structures 


2.1  The  Inner  Product  Step  Processor 

The  single  operation  common  to  all  the  computations  considered  in  this  paper  is 
the  so-called  inner  product  step,  C «-  C ♦ A x B.  We  postulate  a processor  which 
has  three  registers  RA,  Rg,  and  Rq.  Each  register  has  two  connections,  one  for  inpul 
and  one  for  output.  Figure  2-1  shows  two  types  of  geometries  for  this  processor. 
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Figure  2—1 : Geometries  for  the  inner  product  step  processor. 

Type  (a)  geometry  will  be  used  for  matrix-vector  multiplication  and  solution  of 
triangular  linear  systems  (Sections  3 and  6),  whereas  type  (b)  geometry  will  be  used 
for  matrix  multiplication  and  LU-decomposition  (Sections  4 and  5).  The  processor  is 
capable  of  performing  the  inner  product  step  and  is  called  the  inner  product  step 
processor.  We  shall  define  a basic  time  unit  in  terms  of  the  operation  of  this 
processor.  In  each  unit  time  interval,  the  processor  shifts  the  data  on  its  input  lines 
denoted  by  A,  B and  C into  R^,  Rg  and  Rq,  respectively,  computes 
Rq  4-  Rq  ♦ RA  x Rg,  and  makes  the  input  values  for  R^  and  Rg  together  with  the 
new  value  of  Rq  available  as  outputs  on  the  output  lines  denoted  by  A,  B and  C, 
respectively.  All  outputs  are  latched  and  the  logic  is  clocked  so  that  when  one 
processor  is  connected  to  another,  the  changing  output  of  one  during  a unit  time 
interval  will  not  interfere  with  the  input  to  another  during  this  time  interval.  This  is 
not  the  only  processing  element  we  shall  make  use  of,  but  it  will  be  the  work  horse. 
A special  processor  for  performing  division  will  be  specified  later  when  it  is  used. 
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2.2  Systolic  Arrays 


A systolic  device  is  typically  composed  of  many  interconnected  inner  product  step 
processors.  The  basic  network  organization  we  shall  adopt  is  the  mesh-connected 
scheme  in  which  all  connections  from  a processor  are  to  neighboring  processors. 
(See  Figure  2-2.) 


(c)  hexagonally  connected 


Figyre  2-2:  Mash-connacted  systolic  arrays. 

The  most  widely  known  system  based  on  this  organization  is  the  ILLIAC  IV  (Barnes 
at  a L [1968]).  If  diagonal  connections  are  added  in  one  direction  only,  we  shall  call 
the  resulting  scheme  hexagonally  mesh-connected  or  hex-connected  for  short.  We 
shall  demonstrate  that  linearly  connected  and  hex-connected  arrays  are  natural  for 
matrix  problems. 

Processors  lyinj  on  the  boundary  of  the  systolic  array  may  have  external 
connections  to  the  host  memory.  Thus,  an  input/output  data  path  of  a boundary 
processor  may  sometimes  be  designated  as  an  external  input/output  connection  for 
the  device.  A boundary  processor  may  receive  input  from  the  host  memory  through 
such  an  external  connection,  or  it  may  receive  a fixed  value  such  as  zero.  On  the 
other  hand,  a boundary  processor  can  send  data  to  the  host  memory  through  an 
external  output  connection.  An  output  of  a boundary  processor  may  sometimes  be 
ignored.  This  will  be  designated  by  omitting  the  corresponding  output  line. 
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In  this  paper  we  assume  that  the  processors  in  a systolic  array  are  synchronous 
as  described  in  Section  2.1.  However,  it  is  possible  to  view  the  processors  being 
asynchronous,  each  computing  its  output  values  when  all  its  inputs  are  available,  as 
in  a data  flow  model.  For  the  results  of  this  paper  we  believe  the  synchronous 
approach  to  be  more  direct  and  intuitive. 

The  hardware  demands  of  the  systolic  arrays  in  this  paper  are  readily  seen  to  be 
modest.  The  processing  elements  are  uniform,  interprocessor  connections  are  simple 
and  regular,  and  external  connections  are  minimized.  It  is  our  belief  that 
construction  of  these  systolic  arrays  will  prove  to  be  cost-effective  using,  for 
instance,  the  modern  VLSI  technology. 


3.  Matrix-Vector  Multiplication  on  a Linear  Systolic  Array 

We  consider  the  problem  of  multiplying  a matrix  A - (ajj)  with  a vector 

x - (x1,-.,xn)T.  The  elements  in  the  product  y - (y  »yn)"^  can  be  computed  by  the 

following  recurrences. 


0, 


yfk*l)m 


y(k)  ♦ »*** 


y> 


y(n+I). 


Suppose  A is  an  nxn  band  matrix  with  band  width  w - p+q-1.  (See  Figure  3-1  for 
the  case  when  p - 2 and  q • 3.)  Then  the  above  recurrences  can  be  evaluated  by 
pipelining  the  Xj  and  yt  through  a systolic  array  consisting  of  w linearly  connected 
inner  product  step  processors.  We  illustrate  the  operation  of  the  systolic  array  for 
the  band  matrix-vector  multiplication  problem  in  Figure  3-1.  For  this  case  the 
linearly  connected  systolic  array  has  four  inner  product  step  processors.  See 
Figure  3-2. 


The  general  scheme  of  the  computation  can  be  viewed  as  follows.  The  yt,  which 
are  initially  zero,  are  pumped  lo  the  left  while  the  x;  are  pumped  to  the  right  and 
the  ajj  are  marching  down.  (For  the  general  problem  of  computing  Ax*d  where 
d-fd^-.d,,)'  is  any  given  vector,  y;  should  be  initialized  as  dj).  All  the  moves  are 
synchronized.  It  turns  out  that  each  yj  is  able  to  accumulate  all  its  terms,  namely,  aj 
i-2Xj_2*  ■i,j-l*i-l*  •jti*j  before  it  leaves  the  network.  Figure  3-3 

illustrates  the  first  seven  pulsations  of  the  systolic  array.  4iote  that  when  y j and  y2 
are  output  they  have  the  correct  values.  Observe  also  that  at  any  given  time 
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y,  , Initialized  as  zero, 
is  pumped  into  the  fourth 
processor. 


x,  is  pumped  into  the  first 
processor  while  y,  is  moved 
left  one  place.  (From  now 
on  the  x(  and  y(  keep  moving 
right  and  left,  respectively.) 


a„  enters  the  second 
processor  where  y,  is 
updated  by  y,  *-  y,  + a„  *,  • 
Thus  y,  - an  x,. 
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a(1and  ax;  enter  the  first 
and  third  processors, 
respectively,  y,  ■ a„x,+ 
and  a^|X| . 


y,  is  pumped  out 
y2  • Vl+  ^ 

* " V.‘ 


^IXI+ 

a3.x.+ 
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y2  is  pumped  out. 

y,*  “».x«+  v1**  w 
y*“  s^x- 


Figure  3*3:  The  first  seven  pulsations  of  the  linear  systolic  array  in  Figure  3-2. 

i alternate  processors  are  idle.  Indeed,  by  coalescing  pairs  of  adjacent  processors,  it 

ie  possible  to  use  w/2  processors  in  the  network  for  a general  band  matrix  with 
band  width  w. 

I 
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We  now  specify  the  operation  of  the  ystolic  array  more  precisely.  Assume  that 
the  processors  are  numbered  by  integers  1,  2,  . . w from  the  left  end  processor  to 
the  right  end  processor.  Each  processor  has  three  registers,  RA,  Rx  and  Ry,  which 
will  hold  entries  in  A,  x and  y,  respectively.  Initially,  all  registers  contain  zeros. 
Each  pulsation  of  the  systolic  array  consists  of  the  following  operations,  but  for  odd 
numbered  pulses  only  odd  numbered  processors  are  activated  and  for  even 
numbered  pulses  only  even  numbered  processors  are  activated. 

1.  Shift. 

- Ra  gets  a new  element  in  the  band  of  matrix  A. 

- Rx  gets  the  contents  of  register  Rx  from  the  left  neighboring 
node.  (The  Rx  in  processor  1 gets  a new  component  of  x.) 

- Ry  gets  the  contents  of  register  R from  the  right  neighboring 
node.  (Processor  1 outputs  its  Ry  contents  and  the  Ry  in 
processor  w gets  zero.) 

2.  Multiply  and  Add. 

Ry  ♦-  Ry  ♦ RA  x Rx. 

Using  the  type  (a)  inner  product  step  processor  postulated  in  section  2,  we  note 
that  the  three  shift  operations  in  step  1 can  be  done  simultaneously,  and  that  each 
pulsation  of  the  systolic  array  takes  a unit  of  time.  Suppose  the  bandwidth  of  A is 
w - p*q-l.  It  is  readily  seen  that  after  w units  of  time  the  components  Of  the 
product  y - Ax  are  pumped  out  from  the  left  end  processor  at  the  rate  of  one 
output  every  two  units  of  time.  Therefore,  using  our  systolic  network  all  the  q 
components  of  y.  can  be  computed  in  2n+w  time  units,  as  compared  to  the  O(wn)  time 
needed  for  a sequential  algorithm  on  a uniprocessor  computer. 

4.  Matrix  Multiplication  on  a Hexagonal  Systolic  Array 

This  section  considers  the  problem  of  multiplying  two  nxn  matrices,  it  is  easy  to 
see  that  the  matrix  product  C - (Cjj)  of  A - (ajj)  and  8 - (bjj)  can  be  computed  by 
the  following  recurrences. 
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Let  A and  8 be  nxn  band  matrices  of  band  width  wj  and  Wg,  respectively.  We  show 
how  the  recurrences  above  can  be  evaluated  by  pipelining  the  ay,  by  and  cy 
through  a systolic  array  having  WjW2  hex-connected  inner  product  step  processors. 

‘ We  illustrate  the  general  scheme  by  considering  the  matrix  multiplication  problem 

depicted  in  Figure  4-1.  The  diamond  shaped  systolic  array  for  this  case  is  shown  in 
Figure  4-2,  where  processors  are  hex-connected  and  data  flows  are  indicated  by 
arrows. 


A B C 


Figure  4-1:  Band  matrix  multiplication. 

The  elements  in  the  bands  of  A,  8 and  C are  pumped  through  the  systolic  network  in 
three  directions  synchronously.  Each  cy  is  initialized  to  zero  as  it  enters  the 
network  through  the  bottom  boundaries.  (For  the  general  problem  of  computing 
AB+O  where  D-(dy)  is  any  given  matrix,  cy  should  be  initialized  as  dy.)  One  can 
easily  see  that  with  the  type  (b)  inner  product  step  processors  described  in  Section 
2,  each  Cy  is  able  to  accumulate  all  its  terms  before  it  leaves  the  network  through 
the  upper  boundaries.  Figure  4-3  shows  four  consecutive  pulsations  of  the 
hexagonal  systolic  array.  The  reader  is  invited  to  study  the  data  flow  of  this 
problem  more  closely  by  making  tranparencies  of  the  band  matrices  shown  in  the 
figures,  and  moving  them  over  the  network  picture  as  described  above. 

Let  A and  B be  nxn  band  matrices  of  band  width  w_j_  and  wo.  respectively.  Then  a 
systolic  array  of  w^wo  hex-connected  processors  can  pipeline  the  matrix 
multiplication  AxB  in  3n+min(w^.  w^)  units  of  time.  Note  that  in  any  row  or  column  of 
the  network,  out  of  every  three  consecutive  processors,  only  one  is  active  at  given 
time.  It  is  possible  to  use  about  (WjW2>/3  processors  in  the  network  for  multiplying 
two  band  matrices  with  band  widths  wj  and  W2- 


5.  The  LU-Decomposition  of  a Matrix  on  a Hexagonal  Systolic  Array 

The  problem  of  factoring  a matrix  A into  lower  and  upper  triangular  matrices  L 
and  U is  called  LU-decomposition.  Figure  5-1  illustrates  the  lU-decomposition  of  a 
band  matrix  with  p ■ 4 and  p ■ 4.  Once  the  L and  U factors  are  Known,  it  is 
relatively  easy  to  invert  A or  solve  the  linear  system  Ax  ■ b.  We  deal  with  the 
latter  problem  in  section  6.  This  section  describes  a hexagonal  systolic  array  for 
computing  LU-decompositions. 
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Figure  5-1:  The  LU-decomposition  of  a band  matrix. 

We  assume  that  matrix  A has  the  property  that  its  Ill-decomposition  can  be  done 
by  Gaussian  elimination  without  pivoting.  (This  is  true,  for  example,  when  A is  a 
symmetric  positive-definite,  or  an  irreducible,  diagonally  dominant  matrix.)  The 
triangular  matrices  L - (Ijj)  and  U - (u,j)  are  evaluated  according  to  the  following 
recurrences. 
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We  show  that  the  evaluation  of  these  recurrences  can  be  pipelined  on  a 
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hex-connected  systolic  array  o f hex-connected  processors.  A global  view  of  this 
pipelined  computation  is  shown  in  Figure  5-2  for  the  LU-decomposition  problem 
depicted  in  Figure  5-1.  The  systolic  array  in  Figure  5-2  is  constructed  as  follows. 
The  processors  below  the  upper  boundaries  are  the  standard  type  (b)  inner  product 
step  processors  and  are  hex-connected  exactly  same  as  the  matrix  multiplication 
network  presented  in  Section  4.  The  processor  at  the  top,  denoted  by  a circle,  is  a 
special  processor.  It  computes  the  reciprocal  of  its  input  and  pumps  the  result 
southwest,  and  also  pumps  the  same  input  northward  unchanged.  The  other 
processors  on  the  upper  boundaries  are  again  type  (b)  inner  product  step 
processors,  but  their  orientation  is  changed:  the  ones  on  the  upper  left  boundary 
are  rotated  120  degrees  clockwise:  the  ones  on  the  upper  right  boundary  are 
rotated  120  degrees  counterclockwise. 

The  flow  of  data  on  the  systolic  array  is  indicated  by  arrows  in  the  figure.  As  in 
the  hexagonal  systolic  array  for  matrix  multiplication  , each  processor  only  operates 
every  third  time  pulse.  Figure  5-3  illustrates  four  consecutive  pulsations  of  the 
systolic  array.  Note  that  in  the  figure,  because  A is  a band  matrix  with  p - 4 and 
q - 4 we  have  that  a[*^j-  *j*3,i  »nd  af,*fl3“  ai,i+3  ,or  1 S k S i and  i St  2.  Thus  ag2» 
for  example,  can  be  viewed  as  aj^  when  it  enters  the  network. 

There  are  several  equivalent  systolic  arrays  that  reflect  only  minor  changes  to 
the  network  presented  in  this  section.  For  example,  the  elements  of  L and  U can  be 
retrieved  as  output  in  a number  of  different  ways.  Also,  the  '-1*  input  to  the 
network  can  be  changed  to  a %1"  if  the  special  processor  at  the  top  of  the  network 
computes  minus  the  reciprocal  of  its  input. 

If  A ii  an  nxn  band  matrix  with  band  width  w - p+o-1.  £ systolic  array  having  no 
more  than  pq  hcx-connectcd  processors  can  compute  the  LU-decomposition  of  A in 
3n+min(p.q)  units  of  time.  If  A is  an  nxn  dense  matrix,  this  means  that  n^ 
hex-connected  processors  can  compute  the  L and  U matrices  in  4n  units  of  time 
which  includes  I/O  time. 

The  remarkable  fact  that  the  matrix  multiplication  network  forms  a major  part  of 
the  LU-decomposition  network  is  due  to  the  similarity  of  their  defining  recurrences. 
In  any  row  or  column  of  the  LU-decomposition  systolic  array,  only  one  out  of  every 
three  consecutive  processors  is  active  at  a given  time.  As  we  observed  for  matrix 
multiplication,  the  number  of  processors  can  be  reduced  to  about  pq/3. 


,r*j 


Figure  5-2:  The  hex-connected  systolic  array  for  pipelining  the  LU-decompo«iUon 

of  the  band  matrix  in  Figure  5-1. 


6.  Solving  a Triangular  Linear  System  on  a Linear  Systolic  Array 


Suppose  that  we  want  to  solve  a linear  system  Ax  - b.  Then  after  having  done 
the  LU-decomposition  of  A (e.g.,  by  methods  described  in  Section  5),  we  still  have  to 
solve  two  triangular  linear  systems  Ly  - b and  Ux  - y.  This  section  concerns  itself 
with  the  solution  of  triangular  linear  systems.  An  upper  triangular  linear  system  can 


always  be  rewritten  as  a lower  triangular  linear  system.  Without  loss  of  generality, 
this  section  deals  exclusively  with  lower  triangular  linear  systems. 


Let  A ■ (ajj)  be  a nonsingular  nxn  band  lower  triangular  matrix.  Suppose  that  A 
and  an  n-vector  b - (bj^..,bn)T  are  given.  The  problem  is  to  compute  x - (xj,_,xn)T 
such  that  Ax  - b.  The  vector  x can  be  computed  by  forward  substitution: 


! 

! 


yM  + ajkxk, 

*i  - (bj-yj'b/ajj. 


Suppose  that  A is  a band  matrix  with  band  width  w ■ q.  (See  Figure  6-1  for  the 
case  when  q - (I.)  Then  the  above  recurrences  can  be  evaluated  by  a systolic  array 
similar  to  that  used  for  band  matrix-vector  multiplication  in  Section  3.  (Observe  the 
similarity  of  the  defining  recurrences  for  these  two  problems.)  We  illustrate  our 
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result  by  considering  the  linear  system  problem  in  Figure  6-1.  For  this  case,  the 
systolic  array  is  described  in  Figure  6-2. 


Figure  6-2:  The  linearly  connected  systolic  array  lor  solving 
the  triangular  linear  system  in  Figure  6-1. 


j 


The  yj,  which  are  initially  zero,  are  lorced  leftward  through  the  systolic  array 


while  the  xj,  a}j  and  b{  are  pumped  as  indicated  in  Figure  6-2.  The  left  end 
processor  is  special  in  that  it  performs  XjMbj-yjl/ajj.  (In  fact,  the  special  processor 
introduced  in  section  5 to  solve  the  LU-decomposition  problem  is  a special  case  of 


this  more  general  processor.)  Each  yj  accumulates  inner  product  terms  in  the  rest  of 
the  processors  as  it  moves  to  the  left.  At  the  time  yj  reaches  the  left  end  processor 
it  has  the  value  aj|xi ■fai2x2+"‘+ai,i-lxi-l’  ant*»  consequently,  the  Xj  computed  by 
Xj«-(bj-yj)/ajj  at  the  processor  will  have  the  correct  value.  Figure  6-3  demonstrates 
the  first  seven  pulsations  of  the  systolic  array.  From  the  figure  one  can  check  that 
the  final  values  of  xj,  X£,  X3  and  x4  are  all  correct.  With  this  systolic  array  we  can 
solve  an  nxn  band  triangular  linear  system  with  band  width  w - q in  2n+q  units  of 
time.  As  we  observed  for  the  matrix-vector  multiplication  problem,  the  number  of 
processors  required  by  the  array  can  be  reduced  to  w/2. 

7.  Applications  and  Comments 


7.1  Variants  of  the  Systolic  Array 

If  more  information  is  available  about  the  specific  matrices  involved,  an  optimized 
version  of  the  systolic  arrays  presented  above  can  be  used.  It  is  important  that  the 
reader  understands  the  basic  principles  so  that  he  can  construct  appropriate 
variants  for  his  specific  problems.  No  attempt  is  made  here  to  list  ail  the  possible 
variants. 

As  pointed  out  in  Section  1,  although  most  of  our  illustrations  are  of  band 
matrices,  all  the  systolic  arrays  work  for  regular  nxn  dense  matrices.  In  this  case 
the  band  width  of  the  matrix  is  w * 2n-l.  If  the  band  width  of  a matrix  is  so  large 
that  it  requires  more  processors  than  a given  array  provides,  then  one  should 
decompose  the  matrix  and  solve  each  subproblem  on  the  network.  Thus,  for 
example,  the  matrix  multiplication  of  two  nxn  matrices  or  the  LU-decomposition  of  an 
nxn  matrix  can  be  done  in  0(n^/k^)  time  on  a kxk  systolic  array. 

One  can  often  reduce  the  number  of  processors  required  by  a systolic  array  if  the 
matrix  is  known  to  be  sparse  or  symmetric.  For  example,  the  matrices  arising  from  a 
set  of  finite  differences  or  finite  elements  approximations  to  differential  equations 
are  usually  ’sparse  band  matrices’.  These  are  band  matrices  whose  nonzero  entries 
appear  only  in  a few  of  those  lines  in  the  band  which  are  parallel  to  the  diagonal.  In 
this  case  by  introducing  proper  delays  to  each  processor  for  shifting  its  data  to  its 
neighbors,  the  number  of  processors  required  by  the  systolic  array  in  Section  3 can 
be  reduced  to  the  number  of  those  diagonals  which  contain  nonzero  entries.  This 
variant  is  useful  for  performing  iterative  methods  involving  sparse  band  matrices. 
Another  example  concerns  the  LU-decomposition  problem  considered  in  Section  5.  If 
matrix  A is  symmetric  positive  definite,  then  it  is  possible  to  use  only  the  left  portion 
of  the  hex-connected  network,  since  in  this  case  U is  simply  DL^  where  0 is  the 


21 


Pulse 

Number 


Configuration 


Comments 


0 


1 


2 


3 


4 


S 


s 


7 


8 


9 


y,  enters  processor  4. 


y,  moves  left  one  position. 


y2  enters  processor  4. 


x.  ■ o>»-  y. 

(*,  " bi/*u  * *tQC«  7i  * 0.) 


y«.  “ *«  *i  • 


XZ  • 0>r  - 7 1.)/**** 
y3  " **i  X|  • 


y3  ” *Si  xi+a3ix*.* 

7im  Nix,  • 


x,  Is  pumped  out 

x3  "(V  yj>/*»- 

y*<  “ *vi  *»+ 


yy  “ *y,x,+  XVIXJ+  Six3* 

y5  “ ‘sixi* 


xtis  pumped  out. 
x€  - (b^  - yv)/«w. 

y5  “ *M.X<L  + *53  XJ* 


Figure  6-3»  Solving  a lower  band  triangular  system. 
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diagonal  matrix  (a^^)). 

The  optimal  choice  of  the  size  of  the  systolic  network  to  solve  a particular 
problem  depends  upon  not  only  the  problem  but  also  the  memory  bandwidth  to  the 
host  computer.  For  achieving  high  performance,  it  is  desirable  to  have  as  many 
processors  as  possible  in  the  network,  provided  they  can  all  be  kept  busy  doing 
useful  computations. 

It  is  possible  to  use  ouo  systolic  arrays  to  solve  some  nonnumerical  problems 
when  appropriate  interpretations  are  given  to  the  addition  (•*•)  and  multiplication  (x) 
operations.  For  example,  some  pattern  matching  problems  can  be  viewed  as  matrix 
problems  with  comparison  and  Boolean  operations.  It  is  possible  to  store  a 
dynamically  changing  data  structure  in  a systolic  array  so  that  an  order  statistic  can 
always  be  determined  in  constant  time.  We  shall  report  these  results  in  a future 
paper.  It  can  be  instructive  to  view  the  ♦ and  x operations  as  operations  in  an 
abstract  algebraic  structure  such  as  a semiring  and  then  to  examine  how  our  results 
hold  in  such  an  abstract  setting. 

7.2  Convolution,  Filter,  and  Discrete  Fourier  Transform 

There  are  a number  of  important  problems  which  can  be  formulated  as 
matrix-vector  multiplication  problems  and  thus  can  be  solved  rapidly  by  the  systolic 
array  in  Section  3.  The  problems  of  computing  convolutions,  finite  impulse  response 
(FIR)  filters,  and  discrete  Fourier  transforms  are  such  examples.  If  a matrix  has  the 
property  that  the  entries  on  any  line  parallel  to  the  diagonal  are  all  the  same,  then 
the  matrix  is  a Toeplitz  matrix.  The  convolution  problem  is  simply  the  matrix-vector 
multiplication  where  the  matrix  is  a triangular  Toeplitz  matrix  (see  Figure  7-1). 

4 

A p-tap  FIR  filter  can  be  viewed  as  a matrix-vector  multiplication  where  the 
matrix  is  an  band  upper  triangular  Toeplitz  matrix  with  band  width  w - p.  Figure  . 7-2 
represents  the  computation  of  a 4-tap  filter. 

Qn  the  other  hand,  an  n-point  discrete  Fourier  transform  is  the  matrix-vector 
multiplication,  where  the  (i,j)  entry  of  the  matrix  is  «'*"lXj-l)  and  w js  a primitive 
n**1  root  of  unity.  (See  Figure  7-3). 

Therefore  using  a linearly  connected  systolic  array  of  size  n both  the  convolution 
of  two  n-vectors  and  the  n-point  discrete  Fourier  transform  can  be  computed  in  0(n) 
units  of  time,  rather  than  0(n  log  n)  as  required  by  the  sequential  FFT  algorithm. 
Moreover,  note  that  for  the  convolution  and  filter  problems  each  processor  has  to 
receive  an  entry  of  the  matrix  only  once,  and  this  entry  can  be  shipped  to  the 
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Figure  7-3:  The  discrete  Fourier  transform  of  vector  x. 

processor  through  horizontal  connections  and  stay  in  the  processor  during  the  rest 
of  the  computation.  For  the  discrete  Fourier  transform  problem  each  processor  can 
in  fact  generate  on-the-fly  the  powers  of  u it  requires.  As  a result,  for  these  three 
problems  it  is  not  necessary  for  each  processor  in  the  network  to  have  the  external 
input  connection  on  the  top  of  the  processor,  as  depicted  in  Figure  3-2. 

In  the  following  we  describe  how  the  powers  of  u can  be  generated  on-the-fly 
during  the  process  of  computing  an  n-point  discrete  Fourier  transform.  The 
requirement  is  that  if  a processor  is  i units  apart  f^om  the  middle  processor  then  at 
time  i ♦ 2j  the  processor  must  have  the  value  of  fa>J  * ,J  for  all  i,  j.  This  requirement 
can  be  fulfilled  by  using  the  algorithm  below.  We  assume  that  each  processor  has 
one  additional  register  Rj.  All  processors  except  the  middle  one  perform  the 
following  operations  in  each  step,  but  for  odd  (respectively,  even)  numbered  time 
steps  only  processors  which  are  odd  (even)  units  apart  from  the  middle  processor 
are  activated.  For  all  processors  except  the  middle  one  the  contents  of  both  R^  and 
Rf  are  initially  zero. 

1.  Shift.  If  the  processor  is  in  the  left  (respectively,  right)  hand  side  of  the 
middle  processor  then 

- Ra  gets  the  contents  of  register  RA  from  the  right  (respectively, 
left)  neighboring  processor. 

- Rj  gets  the  contents  of  register  Rj  from  the  right  (respectively, 
left)  neighboring  processor. 
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2.  Multiply. 


ra-ra*  Rt- 

The  middle  processor  is  special;  it  performs  the  following  operations  at  every 
ever  numbered  time  step.  For  this  processor  the  contents  of  both  and  Rj  are 
initially  one. 

1.  R*  *"  RA  * Rt^  x w- 

2.  R,  *-  R,  * w. 


7.3  The  Common  Memory  Access  Pattern 

Note  that  all  the  systolic  arrays  given  in  this  paper  store  and  retrieve  elements  of 
the  matrix  in  the  same  order..  (See  Figures  3-2,  4-2,  5-2,  and  6-2.)  Therefore,  we 
recommend  that  matrices  be  always  arranged  in  memory  according  to  this  particular 
ordering  so  that  they  can  be  accessed  efficiently  by  any  of  the  systolic  structures. 


7.4  The  Pivoting  Problem,  and  Orthogonal  Factorization 

In  section  5 we  assume  that  the  matrix  A has  the  property  that  there  is  no  need 
of  using  pivoting  when  Gaussian  elimination  is  applied  to  A What  should  one  do  if  A 
does  not  have  this  nice  property?  (Note  that  Gaussian  elimination  becomes  very 
inefficient  on  mesh-connect  processors  if  pivoting  is  necessary.)  This  question 
motivated  us.  to  consider  Givens'  transformation  (see,  for  example,  Hammering 
[1974])  for  trianguiarizing  a matrix,  which  is  known  to  be  a numerically  stable 
method.  It  turns  out  that,  like  Gaussian  elimination  without  pivoting,  the  orthogonal 
factorization  based  on  Givens'  transformation  can  be  implemented  naturally  on 
mesh-connected  processors,  although  a pipelined  systolic  array  implementation 
appears  to  be  more  complex.  Our  results  on  Givens’  transformation  will  be  reported 
in  another  paper.  (Sameh  and  Kuck  [1978]  considers  parallel  linear  system  solvers 
based  on  Givens’  transformation,  but  they  do  not  give  solutions  to  the  processor 
communication  problem  considered  in  this  paper.) 

8.  Concluding  Remarks 

Systolic  structures  provide  a model  of  computation  for  studying  parallel  algorithms 
for  VLSI.  The  model  takes  into  account  issues  such  as  I/O,  control,  and 
interprocessor  communication.  In  a systolic  system  pipelining  can  overlap  I/O  with 
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computation  to  ensure  high  throughput.  Since  loading  of  data  into  the  network 
occurs  naturally  as  computation  proceeds,  no  extra  control  logic  is  required.  Nor  is 
initialization  logic  needed.  Communication  among  processors  is  through  fixed  data 
paths.  For  a low  cost  and  high  performance  implementation  in  VLSI  (or  even  printed 
circuit  technology),  it  is  desireable  that  these  paths  have  simple  and  regular 
geometries.  These  reasons  make  systolic  arrays  considered  in  this  paper  especially 
attractive.  Indeed,  interconnection  structures  other  than  arrays  exist  which  satisfy 
these  constraints.  Future  work  will  examine  some  of  these  connection  schemes  and 
demonstrate  that  systolic  systems  generalize  beyond  simple  cellular  structures. 

We  have  discovered  that  some  data  flow  patterns  are  fundamental  in  matrix 
computations.  For  example,  the  two-way  flow  on  the  linearly  connected  network  is 
common  to  both  matrix-vector  multiplication  and  solution  of  triangular  linear  systems 
(Sections  3 and  6),  and  the  three-way  flow  on  the  hexagonally  mesh-connected 
network  is  common  to  both  matrix  multiplication  and  lU-decomposition  (Sections  4 
and  5).  A practical  implication  of  this  fact  is  that  one  systolic  device  may  be  used 
for  solving  many  different  problems.  Moreover,  we  note  that  almost  ail  the 
processors  needed  in  any  of  these  devices  are  the  inner  product  step  processor 
postulated  in  Section  2.  A careful  design  of  this  processor  is  desirable  since  it  is 
the  work  horse  for  all  the  devices  presented. 

Research  in  interconnection  networks  and  algorithms  has  been  frequently 
motivated  by  parallel  array  computers  such  as  ILLIAC  IV.  (See,  for  example,  Kuck 
[1968,  1977]  and  Stone  [1975].)  Although  the  results  presented  in  this  paper  were 
motivated  by  the  advance  is  VLSI,  they  reach  beyond.  The  systolic  arrays  in  this 
paper  can  be  implemented  as  efficient  algorithms  on  traditional  parallel  array 
machines. 

For  the  important  problem  of  solving  a dense  system  of  n linear  equations  in  0(n) 
time  on  n^  mesh-connected  processors,  we  have  improved  upon  the  recent  results 
Of  Kant  and  Kimura  [1978].  The  basis  of  their  results  is  an  theorem  on  determinants 
which  was  known  to  J.  Sylvester  in  1851.  Their  algorithm  requires  that  the  matrix 
be  "strongly  nonsingular"  in  the  sense  that  every  square  submatrix  is  nonsingular. 
It  is  sufficient  for  our  algorithms  that  the  matrix  be  symmetric  positive-definite  or 
irreducible  diagonally  dominant. 

Hoare  [1977],  and  Thurber  and  Wald  [1975]  describe  some  matrix  multiplication 
algorithms  on  an  orthogonally  connected  processor  array.  Unlike  our  results,  their 
elgorilhms  require  that  one  or  more  of  the  three  matrices  involved  in  matrix 
multiplication  stay  in  the  array  statically  during  the  computation.  This  introduces 
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overheads  in  I/O  time  and  control  logic  for  loading  the  array  with  the  static  matrix. 
Our  systolic  array  makes  use  of  the  hexagonal  connections  to  pipeline  all  three 
matrices. 

Processor  communication  will  likely  dominate  the  cost  of  parallel  algorithms  and 
systems.  Communication  paths  inherently  require  more  space  and  energy  than 
processing  elements  do.  We  regard  the  problem  of  minimizing  communication  costs 
as  fundamental,  and  we  believe  systolic  structures  provide  models  that  can  bridge 
the  gap  between  theory  and  practice.  Systolic  arrays  can  be  buift  in  VLSL 
Connected  to  a standard  Von  Neumann  computer,  a systolic  device  provides 
inexpensive  but  massive  computation  power. 
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